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Abstract 

Background: Early screening for cancer is arguably one of the greatest public health advances over the last fifty 
years. However, many cancer screening tests are invasive (digital rectal exams), expensive (mammograms, imaging) 
or both (colonoscopies). This has spurred growing interest in developing genomic signatures that can be used for 
cancer diagnosis and prognosis. However, progress has been slowed by heterogeneity in cancer profiles and the 
lack of effective computational prediction tools for this type of data. 

Results: We developed anti-profiles as a first step towards translating experimental findings suggesting that 
stochastic across-sample hyper-variability in the expression of specific genes is a stable and general property of 
cancer into predictive and diagnostic signatures. Using single-chip microarray normalization and quality assessment 
methods, we developed an anti-profile for colon cancer in tissue biopsy samples. To demonstrate the translational 
potential of our findings, we applied the signature developed in the tissue samples, without any further retraining 
or normalization, to screen patients for colon cancer based on genomic measurements from peripheral blood in an 
independent study (AUC of 0.89). This method achieved higher accuracy than the signature underlying 
commercially available peripheral blood screening tests for colon cancer (AUC of 0.81). We also confirmed the 
existence of hyper-variable genes across a range of cancer types and found that a significant proportion of 
tissue-specific genes are hyper-variable in cancer. Based on these observations, we developed a universal cancer 
anti-profile that accurately distinguishes cancer from normal regardless of tissue type (ten-fold cross-validation 
AUC > 0.92). 

Conclusions: We have introduced anti-profiles as a new approach for developing cancer genomic signatures that 
specifically takes advantage of gene expression heterogeneity. We have demonstrated that anti-profiles can be 
successfully applied to develop peripheral-blood based diagnostics for cancer and used anti-profiles to develop a 
highly accurate universal cancer signature. By using single-chip normalization and quality assessment methods, no 
further retraining of signatures developed by the anti-profile approach would be required before their application 
in clinical settings. Our results suggest that anti-profiles may be used to develop inexpensive and non-invasive 
universal cancer screening tests. 
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Background 

Early detection through mass screening remains one of 
the most effective approaches for reducing health care 
costs [1-4] and mortality [5-10] due to cancer. Despite 
the benefits, there remain significant barriers to cancer 
screening including cost [11,12], lack of insurance 
[11,13], and anxiety or embarrassment about invasive 
procedures [11,12,14]. There are also cancer types for 
which mass-screening tools have not been developed 
[15,16]. Reducing the cost and inconvenience of screen- 
ing may lead to increased early screening and potentially 
improve patient and health economic outcomes. 

Peripheral blood-based genomic signatures are a 
promising avenue for developing non-invasive cancer 
biomarkers [17-21]. However, lack of stable markers in 
cancer gene expression profiles and associated blood 
samples has made finding robust screening biomarkers 
difficult. Here we take advantage of a new theoretical 
model for evolutionary fitness that suggests that a defin- 
ing characteristic of cancer is increased epigenetic and 
gene expression variability [22]. Supporting evidence was 
provided by the observation of increased variability in 
DNA methylation across five different cancer types [23]. 
This model implies that a stable characteristic is that 
certain genes will consistently show higher across - 
sample variability in cancer as compared to normal sam- 
ples. We present a statistical technique that leverages 
this characteristic by identifying genes that show normal 
variation in healthy samples, but hyper- variability across 
tumor samples and use these genes to predict outcome 
using what we refer to as an anti-profile. We define an 
anti-profile score for a specific sample as the number of 
hyper-variable genes for which expression in that sample 
falls outside a defined range of normal expression (see 
Methods for details). We illustrate the technique on a 
colon cancer dataset, suggest its potential by predicting 
cancer in a peripheral blood dataset, and explore the 
possibility of a universal cancer predictor by simultan- 
eously predicting outcome with data from 52 cancer 
types. All datasets were obtained from public 
repositories. 

We complement our novel statistical approach with 
new biological insights related to cancer. For the colon 
cancer anti-profiles we incorporate the finding that con- 
sistent decreases in methylation are observed along large 
(5kb - 10Mb) genomic blocks [23]. Specifically, we only 
considered genes that lie inside these blocks for the 
colon cancer anti-profile. For the universal anti-profile 
we incorporated the finding that genes showing epigen- 
etic hyper-variability in cancer tend to be tissue specific 
genes [23-25]. We therefore restricted genes in our uni- 
versal cancer anti-profile to tissue-specific genes. 

Gene expression variability and stochasticity have been 
studied previously in the context of normal populations 



[26,27], with recent work exploring the role of genetic 
variants in altering expression variation and stochasticity 
[28]. Of particular interest is recent work showing a link 
between variation in normal populations and HIV sus- 
ceptibility [29]. It is only recently, however, that direct 
association between gene expression variability and dis- 
ease has been studied on neurological disease [23,30] 
and cancer [23]. We show that increased variability in 
specific genes is a characteristic feature in many cancer 
types that can be used for prediction. The anti-profile 
method we propose here is an application to the predict- 
ive setting of ideas in existing statistical methods devel- 
oped to identify and model outliers in gene expression 
due to cancer [31,32]. Here we expand these ideas and 
leverage our knowledge of and experience with prepro- 
cessing and normalization of high- throughput expres- 
sion data to describe and demonstrate the effectiveness 
of the anti-profile method to develop signatures based 
on technology ready to be used in clinical settings 
(through quality assessment and normalization) and a 
general and stable cancer marker (increased gene ex- 
pression hyper variability of specific genes). 

Results and discussion 

Gene expression anti-profiles 

We developed the anti-profile method as a simple and 
robust approach to define cancer genomic signatures by 
specifically taking advantage of heterogeneity in cancer. 
An important first step in our approach is to normalize 
raw gene expression data; an often-overlooked, but key 
issue in the development of genomic signatures based 
on microarray data. Standard microarray normalization 
methods cannot be used when developing clinical diag- 
nostics since they require multiple samples and normal- 
ized values depend on which samples are normalized 
together [33,34]. This means that signatures can only be 
translated to the clinic after independent retraining of 
the signatures is performed with single-sample 
normalization techniques [35]. For all signatures devel- 
oped here, we employ a recently developed single- 
sample normalization technique for microarrays [36] 
and a single-array quality metric [37]. Since signatures 
are developed with single-sample normalization, they 
can be directly used as clinical diagnostics, without fur- 
ther retraining. 

To illustrate our method we developed an expression 
anti-profile that distinguishes colon cancer from normal 
colon in tissue biopsies. We used two independent colon 
cancer studies, performed by different groups [38-40], as 
an example. We designated one of these datasets as a 
training set [38,39] and looked at genes inside reported 
colon methylation change blocks [23] to select those that 
showed hyper-variability within colon cancer samples 
compared to normals. This dataset [38,39] includes 
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premalignant lesions (adenomas) which we treated as a 
separate biological class and were not included in the 
following analysis. We applied the resulting anti-profile 
signature on the independent testing colon cancer data- 
set in biopsies [40] to evaluate its accuracy and observed 
area under the ROC curve (AUC) of 0.94 (Figure IB) 
with 76% accuracy. We also performed the same experi- 
ment with training and testing sets reversed and 
obtained an AUC of 1.0 with 86% accuracy. We found 
that the normal ranges of expression defined independ- 
ently by the two colon cancer experiments were stable 
(Figure 1C), consistent with the observation that these 
genes are tightly regulated in normal tissue. 

To determine the relationship between gene expres- 
sion hyper-variability and CpG DNA methylation hyper- 
variability, we examined a publicly available DNA 
methylation dataset comparing colon cancer with 
matched normal colon tissue on the Illumina Human- 
Methylation 27k BeadChip array (see Methods). We 
found that there is significant overlap between genes 
with hyper-variable expression in colon cancer and pro- 
moter region CpG hyper-variable methylation (Fishers 
exact test OR=2.41, P=0.005, see Methods). We then 
repeated the experiment on the two colon cancer 



expression datasets using CpG hyper-variable methyla- 
tion to select anti-profile genes and observed worse pre- 
diction performance (AUC=.84 and AUC=.97). 
Enrichment of hyper-variable CpG DNA methylation in 
blocks of hypo-methylation for this dataset has been 
previously reported [23]. Considering the reduced cover- 
age of the 27k array, which is biased towards CpG 
islands, this prediction result indicates the advantage of 
using hypo-methylation blocks in cancer as a stable and 
comprehensive proxy for methylation hyper-variability in 
the absence of suitable direct measurements. 

Colon cancer biomarker in peripheral blood 

We combined the two colon-cancer tissue datasets 
described above and derived one anti-profile signature 
(542 genes). We directly applied the anti-profile derived 
from colon tissue to publicly available peripheral blood 
samples that passed quality assessment (see Methods 
section for details) from cancer patients (n=15) and nor- 
mal samples (n=15) without any retraining [19]. We 
were able to accurately identify colon cancer samples 
from peripheral blood (AUC 0.89, Figure 2 and Add- 
itional file 1: Figure SI). Without retraining, the accur- 
acy of our anti-profile signature was equivalent to the 
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Figure 1 The colon cancer anti-profile signature. (A) Normalized gene expression for 15 hyper-variable genes in cancer from two 

independent colon cancer datasets [38-40]. Normal samples are shown in green, cancer samples are shown in orange. We define the anti-profile 

as the set of genes and a corresponding range of normal expression values for each gene (indicated by dotted lines). Only genes inside colon 

methylation blocks [23] were included. The onti-profile score for each sample is the number of genes in the signature that are outside their 

defined range of expression. Blue circles highlight expression for one specific cancer sample with an anti-profile score of 9. (B) ROC curves using 

the anti-profile method trained on one colon cancer study to score samples from an independent colon cancer study. The anti-profile includes 

genes inside colon DNA methylation change blocks where across-sample variance in cancer is at least twice that of normal in the training study. 

The anti-profile method is very accurate (ROCs of 0.94 and 1.00). (C) We compare the upper bounds of normal expression (median + 5*median 

absolute deviation) as defined by the two independent colon cancer studies and find that ranges are highly consistent. 
\ ) 
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Figure 2 The colon cancer peripheral blood anti-profile. (A) Plot of the anti-profile scores calculated with the colon tissue anti-profile on an 
independent peripheral blood study without retraining [19]. (B) ROC curve and AUC value for the anti-profile prediction on the independent 
peripheral blood study. The anti-profile method achieves an AUC of 0.89 without any retraining. 



training-set accuracy achieved by the 5-gene score devel- 
oped by Han et al [19] directly on these blood samples 
(AUC =0.88). Estimated training-set accuracy is known 
to be an overestimate of the true out of sample accuracy 
for a signature [41], so we also tested the five-gene signa- 
ture using logistic regression and found its leave-one-out 
AUC to be 0.81 (P-value=0.19 for test of differences be- 
tween this and the AUC for the anti-profile signature). 
We note that further optimization of our anti-profile for 
this task is possible by selecting the optimal number of 
genes based on performance on the peripheral blood 
samples themselves. For instance, a slightly larger anti- 
profile signature (650 genes) achieved an AUC of 0.93 
(Additional file 1: Figure SI, P-value=0.08 for test of dif- 
ferences between AUCs). However, this type of 
optimization should be based on datasets with more 
samples than available here and thus we didn't pursue 
this avenue further. 

Consistent hyper-variability across cancer types 

We collected and manually curated a set of 6,172 cancer 
and normal microarray samples in biopsies (n=4,950 and 
n= 1,222 respectively) from 59 tumor types and 102 nor- 
mal tissue types across 176 different studies in the Gene 
Expression Omnibus (GEO, [42]). Additional file 1: 
Table SI lists the GEO accession number of experiments 
included in the dataset after removing samples that did 
not pass the single-chip quality filtering criteria, along 
with the tissue or tumor type and clinical characteristics 
annotated in each experiment. These data represent all 
the clinical information available about each of these 
samples in GEO. For each tissue or tumor type the num- 
ber of biological replicates varied and for seven tissue 
types (adrenal cortex, colon, endometrium, kidney, skin, 
stomach and vulva) we had at least 10 samples of each 
of normal tissue and corresponding tumor type. 

Using these data we developed an anti-profile to pre- 
dict cancer status regardless of tumor or tissue type. 



First, we confirmed that across-sample variability was a 
general characteristic of cancer (Additional file 1: Figure 
S2). We selected hyper- variable genes and defined nor- 
mal ranges as described above (details on the few tech- 
nical differences are described in the Methods section). 
Looking at the top 100 genes that showed consistent 
hyper-variability in cancer we found they were consist- 
ently unexpressed in most normal tissues while consist- 
ently expressed in a few normal tissues (Figure 3A). In 
contrast, no consistency of expression was observed in 
cancer (Figure 3A). We observed the same pattern on an 
independent set of samples not used to define hyper- 
variable genes (Additional file 1: Figure S3). We con- 
firmed that hyper-variable genes in cancer coincide with 
tissue specific genes (Figure 3B and C, Additional file 1: 
Figure S4). Specifically, we found that the set of tissue- 
specific genes were enriched for universally hyper- 
variable genes (Fisher test, odds-ratio 3.1, P<2.2e-16, 
Additional file 1: Figure S5). Gene ontology category en- 
richment analysis [43] performed on the anti-profile 
genes found that categories involving development, 
organ morphogenesis and differentiation are enriched 
with hyper-variable genes (Additional file 1: Table S2). 

Consistent hyper-variability across cancer is not due to 
cellular heterogeneity 

Our results suggest that the universally consistent gene 
expression hyper-variability we report here cannot be 
fully ascribed to cellular heterogeneity in cancer samples. 
For a gene to show hyper-variability in cancer due to 
cellular heterogeneity, it must also be a marker for a 
number of distinct cell types in a heterogeneous cellular 
mixture found in a tumor. However, we found that a 
large number (45%) of universally hyper-variable genes 
in cancer are not consistently expressed in any of the 
normal tissues in our dataset (we say a gene is consist- 
ently expressed for a tissue if it is expressed in at least 
95% of the normal samples for that tissue, see Methods 
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Figure 3 Genes with consistent hyper-variability across cancer types. (A) The 100 genes that most consistently show hyper variability across 
cancer types. We first define a normal range of expression using normal samples across all tissue types expecting that normal samples from a few 
tissue types will deviate from this normal range due to the tissue specificity of some genes. Each cell in the matrix indicates the percentage of 
samples of each type in which expression is outside the normal range. We observed that for the majority of genes, the percentage of samples in 
each normal tissue type outside normal range is close to either 0% (most tissues) or 100% (the small number of tissues for which the gene is 
specific). We also observed that in cancer, percentages are consistently away from 0% or 100%, indicating high variability. (B) Principal 
components for normal samples in adrenal cortex, colon, endometrium, kidney, skin, stomach and vulva. Circles illustrate profiles of normal 
expression for each tissue type. (C) Principal components for cancer samples. Increased variability is present in cancer but not manifested as 
multiple tightly defined sub-groups for each cancer type. Instead, we observe lack of regulation in cancer around tightly regulated regions of 
normal expression in each tissue type. The anti-profile method is based on this observation: stochastic departure from tightly regulated normal 
expression in these genes is characteristic in cancer and can be used in predictive settings. 



section). This implies that, for almost half of the univer- 
sally hyper-variable genes in cancer, hyper-variability 
cannot be the result of a heterogeneous mixture of mar- 
kers for different cellular subtypes since these genes are 
usually silenced in normal tissues. Also, while hyper- 
variable genes are enriched in the set of tissue-specific 
genes, we found that the majority of tissue-specific genes 
are not consistently hyper- variable (64%). The vast ma- 
jority of tissue-specific genes show hyper-variability in a 
small number of cancer types (Additional file 1: Figure 
S6) as expected from a histologically heterogeneous 
sample. This suggests that the lack of regulation of the 
particular tissue-specific genes that are consistently 
hyper-variable across cancer types represents a specific 
and general characteristic of cancer. 

We also investigated the relationship between cancer- 
specific hyper-variability and tissue-specificity in the 
seven tissues for which we have sufficient samples of 



both normal and cancer. We found that the vast major- 
ity (95-99%) of hyper-variable genes in each of these 
cancers are not tissue-specific for the corresponding 
normal tissue (Additional file 1: Table S5). However, 
hyper-variable genes in each of these cancers are 
enriched in the set of genes that are specific for the cor- 
responding normal tissue, although the number of genes 
is small. This small set of genes could indeed include 
those where hyper-variability in that specific cancer is 
due to cellular heterogeneity, as normal cells may be 
included in varying proportions in these tumor samples. 
We looked at the relationship between cancer-specific 
differential expression, determined using Empirical 
Bayes methods [44] as fold-change greater than 1 and 
significance less than 10% FDR, and tissue-specificity in 
the same seven tissues. Similar to hyper-variability we 
found that the vast majority of differentially expressed 
genes in each of these cancers are not tissue-specific for 
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the corresponding normal tissue. However, in contrast 
to hyper-variable genes there is no enrichment of differ- 
entially expressed genes in the set of genes that are spe- 
cific for the corresponding normal tissue. 

Considering this finding, we investigated the relation- 
ship between cellular-specificity and the colon cancer 
peripheral blood result reported above. We determined 
genes that are specific to strictly one of two types of 
lymphocytes for which we had five or more samples in 
our dataset (CD4+ and CD31+ T-cells) and found that 
12% of the genes used in the peripheral blood colon can- 
cer anti-profile fall under this category. Furthermore, 
lymphocyte-specific genes are enriched in the set of 
genes with hyper-variable expression in colon cancer in- 
side colon cancer hypo-methylation blocks (Fishers 
exact test OR 3.0, P=1.2e-ll). This suggests that we can- 
not rule out that varying lymphocyte composition in the 
peripheral blood samples of colon cancer patients may 
drive the prediction performance of the peripheral blood 
anti-profile. 

Universal cancer anti-profile 

While in the colon cancer anti-profile we restricted 
genes to be in the colon-cancer hypo-methylated blocks 
here we used our newly found biological insight: we 
restricted the anti-profile to tissue-specific genes defined 
as those genes that are expressed in at least 95% of sam- 
ples for at most three tissues using the gene expression 
barcode method [45]. With an anti-profile classification 
in place, we then quantified the accuracy of this univer- 
sal anti-profile method by performing two cross- 
validation experiments. We first performed a 10-fold 
cross validation experiment where an anti-profile was 
constructed on the training set of each cross-validation 
fold. The procedure was highly accurate with an average 
area under the ROC curve (AUC) across the 10 cross- 
validation experiments of 0.92 (Figure 4A). We next per- 
formed a novel leave-one-tissue out cross-validation ex- 
periment. For each of the seven tissues for which we had 
both normal and cancer samples, we defined an anti- 
profile using samples from the other six tissues and 
scored samples from the tissue being tested (Figure 4B 
and C). For all experiments, the leave-one-tissue-out 
anti-profiles achieved AUCs greater than 0.87. We also 
observed that the set of probes consistently selected 
across cross-validation experiments is very stable, indi- 
cating the robustness of the anti-profile procedure (Add- 
itional file 1: Figure S7). Our analysis indicates that the 
anti-profile method is able to accurately distinguish 
tumors from normal samples on tissues not included in 
its training set and further suggests the universal applic- 
ability of the anti-profile method. 

We used pathological tumor stage or grade annotation 
available for a subset of the samples used in the leave- 



one-tissue-out cross-validation experiment to determine 
if heterogeneity across samples in pathological tumor 
stage or grade may explain the increased gene expres- 
sion variability observed in anti-profile genes used for 
prediction. For each of the leave-one-tissue-out experi- 
ments reported in Figure 4, we used an F-test to find 
genes that are differentially expressed across pathological 
stages or grades (FDR<0.1, Additional file 1: Table S6). 
We then applied a Fisher exact test to determine if the 
100-gene anti-profile signature used in the leave-one- 
out-tissue experiment overlapped this set of differentially 
expressed genes. We found very few genes that are dif- 
ferentially expressed across pathological tumor stage or 
grade for adrenal cortex, stomach and vulva (22, 2 and 4 
respectively). For the remaining experiments no substan- 
tial overlap was observed (OR<2, P-value<0.05). This 
suggests that increased gene expression variability in 
anti-profile genes is not explained by heterogeneity of 
pathological tumor stage or grade in our samples. 

Conclusions 

We have introduced and developed gene expression 
anti-profiles for cancer biomarker discovery. Anti- 
profiles explicitly model increased gene expression vari- 
ability in cancer to define robust and reproducible gene 
expression signatures capable of accurately distinguish- 
ing tumor samples from healthy controls. We have 
developed an anti-profile signature in tissue samples 
from a colon cancer study and validated our signature in 
a second independent validation set, collected by a dif- 
ferent experimental group. We have also applied this sig- 
nature directly, without retraining, to classify patients 
with cancer from normals on the basis of genomic mea- 
surements in peripheral blood. 

We note that Mammaprint [46,47], one of the most 
successful genomic cancer biomarkers, fits our notion of 
an anti-profile: its score is calculated based on the cor- 
relation between the test sample and a good prognosis 
gene expression profile. The failure of other, more com- 
plex genomic methods to outperform Mammaprint may 
be due to their reliance on defining specific cancer pro- 
files [48]. While both Mammaprint and our anti-profile 
method classify samples based on deviation from a refer- 
ence profile, there are two significant differences in the 
way Mammaprint and the anti-profile method achieve 
this: 1) Mammaprint uses tumor samples with good 
prognosis to determine the reference profile. Since these 
are tumor samples many of the genes used in the profile 
may exhibit high variability across the good prognosis 
group. Defining a stable and robust reference profile is 
essential to the success of this type of method. 2) Mam- 
maprint uses correlation to measure how samples devi- 
ate from the reference profile. Our anti-profile method 
instead uses a robust measure where deviation is based 
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Figure 4 A stochastic universal cancer classifier. (A) ROC curves for a 10-fold cross-validation experiment classifying any sample as normal or 
tumor, where the anti-profile is trained (genes selected and normal regions of expression defined) independently for each fold, and the ROC is 
computed for each testing fold independently. (B) ROC curves for 7 leave-one-tissue-out experiments. In each of the leave-one-tissue-out 
experiments, all samples of that particular type (both normal and tumor) are removed from training sets and then scored using the resulting anti- 
profiles. (C) Cross-validated anti-profile scores for the 7 leave-one-tissue-out experiments. The anti-profile scores can separate a large number of 
tumors from their corresponding normal samples. 



on the number of the genes for which expression falls 
outside normal ranges of expression, which are them- 
selves estimated using robust methods. It may be pos- 
sible to improve on the accuracy of the Mammaprint 
test by adopting a more robust anti-profile based on the 
methods presented in this paper. 

In this case we can use the anti-profile score, that is, 
the number of genes in the anti-profile where expression 
deviates from a normal range of expression obtained 
from normal breast tissue samples, to determine prog- 
nosis. Since this score is based on stable expression in 
normal tissues, it may be more robust than calculating 
correlation to a mean signature for tumors with good 
prognosis that would show high variability. This will 



require that more samples of both normal breast tissue 
and tumor are available on platforms for which robust, 
single-chip normalization methods exist. 

In addition to developing a peripheral blood signature 
for colon cancer, we have confirmed the existence of 
hyper-variable genes across 59 distinct cancer types. We 
also provide evidence of the close relationship between 
hyper-variability across cancer types and tissue-specific 
gene expression. Consistent with these observations on 
tissue-specificity, gene ontology category enrichment 
analysis found that categories involving development, 
organ morphogenesis and differentiation are enriched 
with hyper-variable genes and the remaining gene cat- 
egories enriched with hyper-variable genes involved 
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cellular interaction with extracellular matrix, e.g., adhe- 
sion, localization and collagen catabolic processing or in 
cell locomotion and cellular component movement. 
These results argue strongly against the observed hyper- 
variability being a consequence of sample heterogeneity 
in the cancer samples. 

Incorporating this general result on tissue-specificity 
and hyper-variability we developed anti-profiles able to 
classify tissue samples across multiple tissue and cancer 
types, even when a specific cancer/tissue type is not 
included in the original training set. Our cross-validation 
results suggest that consistent hyper-variability of a small 
set of tissue-specific genes is a stable mark of cancer 
across tissue types. Our results also suggest the potential 
for developing peripheral blood signatures for cancer 
diagnostics on the basis of anti-profiles. 

In the course of achieving these results we have used 
recently developed statistical preprocessing methods to 
remove potential artifacts in a way that is applicable to 
single clinical samples [36]. This is a somewhat unique 
approach, as genomic signatures are typically derived 
after applying population-level pre-processing such as 
RMA or artifact removal such as surrogate variable ana- 
lysis. That we achieve such high accuracy in public data 
- known to be subject to a broad range of technical and 
biological artifacts [37] - speaks to the strength of our 
methods. 

Methods 

Gene expression Affymetrix microarray data 
preprocessing 

We downloaded CEL files for 6,172 Affymetrix 
HGU133plus2 microarrays from 176 studies in the Gene 
Expression Omnibus (GEO, [42]). CEL files were prepro- 
cessed with the frma ([36]) single-chip procedure. Ex- 
pression measurements were standardized using Gene 
Expression Barcode ^-scores ([45]). We removed arrays 
that were deposited multiple times into the repository 
(Euclidean distance between arrays less than 1). We used 
the GNUSE metric ([37]) to assess array quality and 
removed all arrays from studies with median GNUSE 
greater than 1.25 and removed individual arrays with 
GNUSE greater than 1.2. We did further hand curation 
to retain only normal tissue and cancer samples (n=688 
and n=4,138 respectively). Additional file 1: Table SI 
contains the complete list of studies and samples used in 
the reported analyses including the type of clinical anno- 
tation available for each sample. The curated and pre- 
processed data is available for download at http://cbcb. 
umd.edu/~hcorrada/antiProfiles. 

Colon cancer anti-profile 

We used the HGU133plus2 probeset annotation from 
Ensembl (version 15, gene dataset version: GRCh37.p5) 



to map probesets to genes and obtain each genes tran- 
scription start site. In the colon cancer anti-profile, we 
only consider probesets for genes with transcription start 
sites inside blocks of DNA methylation change ([23], 
genomic coordinates available at http://www.nature. 
com/ng/journal/v43/n8/extref/ng.865-S2.xls). We use 
the ratio of standard deviations across samples as a stat- 
istic to select probesets for the anti-profile: r g = log 2(S g J 
S^Jwhere s gc is the across-sample standard deviation of 
expression for probeset g among the colon tumor sam- 
ples, and s gn is the across-sample standard deviation of 
expression for probeset g among the normal samples. 
The anti-profile includes probesets with r g >l (variability 
in cancer is twice that of normal). 

Normal regions of expression are defined for each pro- 
beset as median expression +/- 5 median absolute 
deviations of expression in the normal samples. We 
found that our results are quite insensitive to the choice 
of median absolute deviation multiplier (Additional file 
1: Figure S8). The anti-profile score for a specific sample 
is then the number of probesets outside their respective 
range of normal expression. A cutoff score can be used 
to turn the anti-profile score into a classification: scores 
greater than the cutoff are classified as cancer, scores 
lower than the cutoff are classified as tumor. A specific 
cutoff can be determined according to a prescribed ob- 
jective: e.g. maximize accuracy, or maximize specificity 
at a given sensitivity in a held-aside test set. We used 
area under the ROC curve [49] to measure anti-profile 
accuracy and the DeLong method [50] as implemented 
in the pROC package [51] to test for differences in AUC. 

Colon cancer illumina HumanMethylation 27k array 

We downloaded a publicly available dataset of methyla- 
tion levels of 22 matched colon normal/tumor samples 
assayed using Illuminas HumanMethylation 27k array 
(GEO accession number GSE 17648). Methylation mea- 
surements were used with no further preprocessing. Dif- 
ferences in methylation variability were determined 
using an F-test and significance determined at 1% false 
discovery rate. For each probeset in our expression data 
we found the CpG inside its promoter region (defined 
as lOOObp upstream and 250bp downstream) nearest to 
the transcription start site. We determined significant 
expression hyper- variability using an F-test at 1% false 
discovery rate to determine overlap between expression 
hyper-variability and DNA methylation hyper-variability. 

Colon cancer peripheral blood data 

We obtained peripheral blood Affymetrix HGU133plus2 
samples from colon cancer patients and healthy controls 
([19] from the study authors, and [52] from GEO with 
accession number GSE10715). Arrays were preprocessed 
with fRMA and normalized using the gene expression 
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barcode. Arrays with GNUSE values >1.2 were removed, 
which left 15 colon cancer samples and 15 normal sam- 
ples from the first study. Median GNUSE for the second 
study was 1.46 and thus was not included in the analysis 
(all but three cancer samples had GNUSE >1.2 in this 
study). 

Colon cancer peripheral blood anti-profile signature 

We defined the anti-profile from colon tissue by com- 
bining samples from the two colon cancer biopsy data- 
sets used in the Gene Expression Antiprofiles Results 
section [38,40,52]. Probesets were included in the anti- 
profile and regions of normal expression defined as 
described above. No retraining was done to test on the 
blood dataset. The list of genes and corresponding me- 
dian and median absolute deviation of expression are 
given in Additional file 2: Table S3. 

To assess the sensitivity to signature size of the accur- 
acy of the peripheral blood signature, we tested signa- 
tures of increasing size with genes included in order of 
decreasing hyper-variability across colon tumor samples 
(Additional file 1: Figure SI). While the signature 
reported in the manuscript obtained an AUC of 0.89, 
similar AUCs are obtained with signatures with about 
500-2000 genes inside blocks indicating that the predic- 
tion result reported in the manuscript is not very sensi- 
tive to the specific signature size chosen. To ascertain 
significance of the prediction results obtained we per- 
formed a randomization test: for each signature size, we 
generated 1000 signatures with randomly selected sub- 
sets of genes of the appropriate size to build each anti- 
profile. Ranges of normal expression do not change since 
these are defined from the colon tissue dataset. We used 
the proportion of random signatures obtaining an AUC 
greater than or equal to the anti-profile of the corre- 
sponding size as a measure of uncertainty. Results that 
showed significantly high AUC were signatures that in- 
clude about 500-2000 of the top hyper- variable genes 
inside methylation blocks. 

Universal hyper-variable genes in cancer 

To determine probesets that exhibit hypervariable ex- 
pression in cancer we compute a variance ratio statistic 
across multiple tissues. We restrict this computation to 
tissues and cancer types with more than 10 samples in 
our dataset (list given in Figure 3). We compute stand- 
ard deviation of expression for probeset g (s gt ) separately 
for each tissue t and cancer type c (s gc ). We define the 
variance ratio statistic u g (Additional file 1: Figure S2) as 
Ug = log 2 {^iean c s gc /mean t s gt ). 

To define the universal normal range of expression we 
use a similar method: we compute median expression 
for each gene g on each tissue t separately (m gt ) along 
with median absolute deviation (mad gt ). The universal 



range is then defined as m g +/- 5 * mad g where m g =me- 
dian t (m gt ) and mad g =median t (mad gt ). The list of hyper- 
variable genes (u g >l) and associated median expression 
and median absolute deviation of expression are pro- 
vided in Additional file 3: Table S4. 

Defining tissue-specific genes 

To define tissue-specific genes, we tabulated the number 
of samples in which a gene is expressed (defined as gene 
expression barcode z-score greater than 2.54) for each 
tissue in our dataset with more than 10 normal samples. 
Tissue-specific genes were defined as those in which the 
gene is expressed in more than 95% of the samples of at 
most three tissues. Fisher s exact test was used to deter- 
mine enrichment of hyper-variable genes in the set of 
tissue-specific genes (Additional file 1: Figure S5). 

Gene ontology category enrichment analysis 

Gene ontology (GO) enrichment analysis was done using 
a hyper-geometric test for association between hyper- 
variable genes (defined as u g >l) and GO terms. We used 
the implementation in the Bioconductor GOstats pack- 
age ([43]). We used the q- value ([53]) method to control 
for multiple hypothesis testing and report enriched cat- 
egories with Q<0.05 in Additional file 1: Table S2. 

Cross-validation experiments 

We performed two types of cross-validation experiments 
to quantify the accuracy of universal cancer anti-profiles. 
The first was ten-fold cross validation, data was ran- 
domly split into 10 equal-sized subsets, retaining the 
proportion of normal and cancer samples from the full 
dataset in each subset. Each of the 10 subsets (or folds) 
was used sequentially as a test set, scored using an anti- 
profile trained on the remaining 90% of the data (this 
includes all steps: 1) filtering to include only tissue- 
specific probesets, 2) computing the universal variance 
ratio u g , 3) selecting the top 100 genes based on the 
ratio statistic, and 4) computing the universal normal 
range of expression). 

The other type of cross-validation experiment was car- 
ried out on the 7 tissues for which we had at least 10 
samples each of normal tissue and tumor. For each tis- 
sue type, we performed a leave-one-tissue-out experi- 
ment by using all samples (normal and corresponding 
tumor type) as test set and scored them using an anti- 
profile trained on the remaining data. This ensures that 
no samples from the corresponding tissue (normal or 
cancer) are included in the training set. Again, all steps 
required to train the anti-profile were done completely 
for each leave-one-tissue-out fold. 

To classify a new sample we count the number of 
anti-profile genes for which their expression fell outside 
their normal range (Figure 2A). A large number of genes 
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with expression outside the normal range, corresponding 
to a high anti-profile score, are indicative of cancer. To 
develop a predictor for new samples, a cutoff must be 
defined on the number of genes outside the normal 
range. If the anti-profile score is less than the cutoff, the 
sample is classified as normal, if it is greater than cutoff 
then the sample is classified as cancer. 
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